This work builds on my first (and failed) attempt to estimate how many ratings one needs to sample for obtaining a dependable measure of a certain population. Therefore, readers are advised to catch up on my previous work to be up to speed on this smaller project.
As the unconventional approach did not yield the desired results, we will revert back to the more “old school” methodologies of an a priori power analysis. That is, estimating the minimum sample size needed to uncover a set effect size (or larger), given a significance level and type II (false negatives) error.
Though in light of such an endeavor, we have two problems:
To solve these problems, the following proposals were taken for this work:
While most of the previous distributions were used again (with the same color as last time) some adjustments were made. Notably, we’ve transitioned to a 11 -point likert scale, which our group has chosen to use for the upcoming study. Here are the distributions with some comments:
5% on the other extreme)It is important to note that while the normal and strong polarization distribution represent no and strong effects, the small and rare distributions should both illustrate small effects in the population, and putting one over the other really depends on how to interpret/ operationalize polarization itself (e.g. asymmetry, distance, agreement etc…).
For this work, we will use different sample sizes consisting of
20 different sizes:
## [1] 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190
## [20] 200
For each sample size, we replicate the random draw 200
times.
As mentioned in the introduction, we will use thresholds for each polarization measure to determine whether polarization is present or not. Luckily, the BC already has a set threshold of \(0.\overline{5}\).
For the two other measures, previous results were consulted.
Realization that the measurement of polarization has a low value for the
rare distribution (even lower than the normal distribution, as it uses a
weighted sum, thus small groups are discounted), leads to a crucial
implication. Setting a low threshold for rare distributions in order to
identify them as polarized will also net us too many false negatives
(e.g. saying normal distributions are polarized). To note, group
divergence values in the previous simulation ranged from .24 and .9,
while the measure of polarization ranged from .1 to .8. Hence the
threshold for group divergence and polarization were (arbitrarily) set
at 0.5 and 0.5 respectively.
To summarize, 4 different population distribution were created and serve as our effect sizes, the thresholds functions as a proxy for when we actually reject or accept that the sampled distribution is polarized (similar to the alpha value and p-values), and thus we use different effect sizes to estimate the power.
Again, this is how our sampled matrix looks like, adopting a staircase like shape. Using this method saves us time, as well as prevents errors.
After we’ve calculated the polarization measures for each of the
16000 drawn samples, the measures were put into a data
frame. As we can see, no missing values are here, indicating that each
measure was calculated successfully.
It is expected that with greater sample size, we would have a more accurate and reliable estimate of the population distribution. Though if the measures of polarization in the population distributions were lower than our set thresholds, then said measures will never indicate polarization (except due to random noise). To illustrate expected pitfalls of the current approach, here is a fitting analogy, where the polarization measures functions as prediction models in classification problems:
Good models with high accuracy can find polarization in case it is polarized (sensitivity), and differentiates distributions as not polarized in case it is not polarized (specificity).
If the measure differentiates the distribution with no polarization and high polarization clearly (e.g. difference of numerical values of said distributions are big), we would correctly categorize them more often according to the threshold.
Using the logic from the previous point, if the model itself is accurate, setting the threshold is not as important. For example, if the polarization measure gives two numerical outputs for two different distributions like .2 and .9, then setting the threshold at .5 or .6 would not matter for the categorization itself. In consequence, if the measure is pretty bad at distinguishing the distributions, like .5 and .6, then we have less margin for error regarding the threshold selection, as the thresholds itself needs to be set in between those values to categorize them correctly.
Setting the threshold may increase true positives at the expense of false positives and vice verca. One can always set the threshold at .1 or .9, thus having a low or high entry point where one would classify polarization (thus taking a liberal or conservative approach).
Taking all of these points into account, the power analysis results mainly hinges on the polarization measurements itself and how fitting the threshold were set.
| Population Distribution | Bimodality Coefficient | Group Divergence | Polarization |
|---|---|---|---|
| None | 0.345 | 0.287 | 0.152 |
| Small Pol. | 0.736 | 0.377 | 0.359 |
| Rare | 0.766 | 0.762 | 0.185 |
| Strong Pol. | 0.884 | 0.804 | 0.823 |
Looking over the results, the measures of group divergence and polarization were of no help. Two possible explanations (which is already discussed further above) come to mind:
The thresholds were not optimally set (e.g. if the threshold were
set in a better position, then we are more likely to find polarization,
at the cost of finding false positives as well). For example, the
threshold for group divergence was too high, so it never categorized the
small polarization
distribution as polarized. In this case, we’ve restrained the effect
size to be at least at 0.5, which may be too restrictive
for this measure.
The measure itself have difficulties distinguishing polarization (or at least how I defined polarization). Take for example the rare distribution, which I defined as truly polarized, and compare it with the not polarized distribution. Their values are near identical. As a reminder, polarization uses weighted sums to calculate polarization. In consequence, it ignores the minority group in the rare distribution, and focuses on the bigger group on the left, which just happens to be a skewed normal distribution.
On the other hand, the BC delivers a totally acceptable and logical output:
All polarized distributions are categorized as polarized, whereas the distribution without polarization is categorized as not polarized.
With increased sample size, we have more power.
The strongly polarized distribution reaches a high power level the fastest, wheras small and rare polarizations are slower due to their implied effect sizes. Bare in mind that rare distributions need bigger sample sizes, so that the probability of sampling from the minority group is sufficiently high.
But on a more thorough investigation, we see that the simulation for the BC indicates a sample size of 20 and 30 is enough to get a power of 80% for the small polarization distribution and rare distribution respectively…. Interpreting this result in a more favorable way, it seems like we’ve only found the bare minimum sample size needed (and even this interpretation is shaky, as the simulation uses random draws in perfect conditions, which does not translate well to our study). In a more realistic (and thus pessimistic way), the results seem to suggest that this approach (again) does not seem to give a satisfactory answer to our problem at hand. A sample size of 20 and 30 is very small and intuitively leads to unreliable outcomes. Using such a small sample size to infer how a population thinks/ feels about certain risks is out of the question.
Only 3 measures of polarization was used, and thus, not every aspect of polarization was covered.
Setting the assumption that the rare polarization is, in fact, polarized, may not hold well in the study itself, as outliers, misinterpretations or even a mouse slip may contribute to such a distribution, which we would then just assume is polarized.
The threshold were set somewhat arbitrary, and therefore, may not hold in the real deal. One might be tempted to change the threshold to find an optimized one where we would have the least false positives and false negatives, but this threshold will only fit the distributions seen here and will not generalize for others. As such, I’ll refrain from finding the optimum, as it would not net us any additional insight to our goal.
Likewise, one could set the thresholds for the BC even higher, thus restricting the amount of ambivalent distributions which were barely making the cutoff. This would lead to a greater sample size needed, but we would have a more reliable and hopefully a more realistic approximation of how many sampled need to be drawn.
As in all classification problems, polarization has no clear cut limit and should be understood as a gradient where some risks are more polarized than others. The threshold implies that distributions just short of the threshold is significantly less polarized then those just above it, which can’t be further from the truth.
In this simulation, samples were drawn at random. Our sampling procedure may not be as random as in this simulation (maybe only those who answer our questionnaires are those on one pole, among other sampling difficulties).
It could also be that the “small polarization” is still a big effect size in the real world. If that is the case, using such a small sample will only lead to a lack of power to find polarization.
This approach failed to deliver a clear cut answer in finding the needed sample size to find polarization in case there is one.
Only the lower limit was found, but due to numerous reasons outlined in the limitations, it is adviced to increase the number…
For those who are curious what would happen if we would change the thresholds, feel free to download the code and change the numbers yourself. I’ve put the variables in the first chunk so it is practical to do so.
agrmt (Ruedin D (2023). agrmt: Calculate Concentration and Dispersion in Ordered Rating Scales. R package version 1.42.12, https://CRAN.R-project.org/package=agrmt.)
doParallel (Corporation M, Weston S (2022). doParallel: Foreach Parallel Adaptor for the ‘parallel’ Package. R package version 1.0.17, https://CRAN.R-project.org/package=doParallel.)
foreach (Microsoft, Weston S (2022). foreach: Provides Foreach Looping Construct. R package version 1.5.2, https://CRAN.R-project.org/package=foreach.)
knitr (Xie Y (2023). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.45, https://yihui.org/knitr/.)
psych (William Revelle (2023). psych: Procedures for Psychological, Psychometric, and Personality Research. Northwestern University, Evanston, Illinois. R package version 2.3.9, https://CRAN.R-project.org/package=psych.)
RColorBrewer (Neuwirth E (2022). RColorBrewer: ColorBrewer Palettes. R package version 1.1-3, https://CRAN.R-project.org/package=RColorBrewer.)
rmarkdown (Allaire J, Xie Y, Dervieux C, McPherson J, Luraschi J, Ushey K, Atkins A, Wickham H, Cheng J, Chang W, Iannone R (2023). rmarkdown: Dynamic Documents for R. R package version 2.25, https://github.com/rstudio/rmarkdown.)
tidyverse (Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.)
visdat (Tierney N (2017). “visdat: Visualising Whole Data Frames.” JOSS, 2(16), 355. doi:10.21105/joss.00355 https://doi.org/10.21105/joss.00355, http://dx.doi.org/10.21105/joss.00355.)
ChatGPT 3.5 (OpenAI. 2023, https://chat.openai.com/chat):